[SOUND].
This lecture is about the syntagmatic
relation discovery and mutual information.
In this lecture we are going to continue
discussing syntagmatic relation discovery.
In particular,
we are going to talk about another
the concept in the information series,
we called it mutual information and
how it can be used to discover
syntagmatic relations.
Before we talked about the problem
of conditional entropy and
that is the conditional entropy
computed different pairs of words.
It is not really comparable, so
that makes it harder with this cover,
strong synagmatic relations
globally from corpus.
So now we are going to introduce mutual
information, which is another concept
in the information series
that allows us to, sometimes,
normalize the conditional entropy to make
it more comparable across different pairs.
In particular, mutual information
in order to find I(X:Y),
matches the entropy reduction
of X obtained from knowing Y.
More specifically the question we
are interested in here is how much
of an entropy of X can
we obtain by knowing Y.
So mathematically it can be
defined as the difference between
the original entropy of X, and
the condition of Y of X given Y.
And you might see,
as you can see here it can also be defined
as reduction of entropy of
Y because of knowing X.
Now normally the two conditional
interface H of X given Y and
the entropy of Y given X are not equal,
but interestingly,
the reduction of entropy by knowing
one of them, is actually equal.
So, this quantity is called a Mutual
Information in order to buy I here.
And this function has some interesting
properties, first it is also non-negative.
This is easy to understand because
the original entropy is always
not going to be lower than the possibility
reduced conditional entropy.
In other words, the conditional entropy
will never exceed the original entropy.
Knowing some information can
always help us potentially, but
will not hurt us in predicting x.
The signal property is that it
is symmetric like additional
entropy is not symmetrical,
mutual information is, and
the third property is that It
reaches its minimum, zero, if and
only if the two random variables
are completely independent.
That means knowing one of them does not
tell us anything about the other and
this last property can be verified by
simply looking at the equation above and
it reaches 0 if and
only the conditional entropy of X
[INAUDIBLE] Y is exactly the same
as original entropy of X.
So that means knowing why it did not
help at all and that is when X and
a Y are completely independent.
Now when we fix X to rank different
Ys using conditional entropy
would give the same order as
ranking based on mutual information
because in the function here,
H(X) is fixed because X is fixed.
So ranking based on mutual entropy is
exactly the same as ranking based on
the conditional entropy of X given Y, but
the mutual information allows us to
compare different pairs of x and y.
So, that is why mutual information is
more general and in general, more useful.
So, let us examine the intuition
of using mutual information for
Syntagmatical Relation Mining.
Now, the question we ask forcing
that relation mining is,
whenever "eats" occurs,
what other words also tend to occur?
So this question can be framed as
a mutual information question, that is,
which words have high mutual
information was eats,
so computer the missing information
between eats and other words.
And if we do that, and it is basically
a base on the same as conditional
we will see that words that
are strongly associated with eats,
will have a high point.
Whereas words that are not related
will have lower mutual information.
For this, I will give some example here.
The mutual information between "eats" and
"meats",
which is the same as between "meats" and
"eats," because the information is
symmetrical is expected to be higher than
the mutual information between eats and
the, because knowing the does not
really help us as a predictor.
It is similar, and
knowing eats does not help us predicting,
the as well.
And you also can easily
see that the mutual
information between a word and
itself is the largest,
which is equal to
the entropy of this word and
so, because in this case the reduction is
maximum because knowing one allows
us to predict the other completely.
So the conditional entropy is zero,
therefore the mutual information
reaches its maximum.
It is going to be larger, then are equal
to the machine volume eats in other words.
In other words picking any other word and
the computer picking between eats and
that word.
You will not get any information larger
the computation from eats and itself.
So now let us look at how to
compute the mute information.
Now in order to do that, we often
use a different form of mutual
information, and we can mathematically
rewrite the mutual information
into the form shown on this slide.
Where we essentially see
a formula that computes what is
called a KL-divergence or divergence.
This is another term
in information theory.
It measures the divergence
between two distributions.
Now, if you look at the formula,
it is also sum over many combinations of
different values of the two random
variables but inside the sum,
mainly we are doing a comparison
between two joint distributions.
The numerator has the joint,
actual observed the joint distribution
of the two random variables.
The bottom part or the denominator can be
interpreted as the expected joint
distribution of the two random variables,
if they were independent because when
two random variables are independent,
they are joined distribution is equal to
the product of the two probabilities.
So this comparison will tell us whether
the two variables are indeed independent.
If they are indeed independent then we
would expect that the two are the same,
but if the numerator is different
from the denominator, that would mean
the two variables are not independent and
that helps measure the association.
The sum is simply to take into
consideration of all of the combinations
of the values of these
two random variables.
In our case, each random variable
can choose one of the two values,
zero or one, so
we have four combinations here.
If we look at this form of mutual
information, it shows that the mutual
information matches the divergence
of the actual joint distribution
from the expected distribution
under the independence assumption.
The larger this divergence is, the higher
the mutual information would be.
So now let us further look at what
are exactly the probabilities,
involved in this formula
of mutual information.
And here, this is all the probabilities
involve, and it is easy for
you to verify that.
Basically, we have first to
[INAUDIBLE] probabilities
corresponding to the presence or
absence of each word.
So, for w1,
we have two probabilities shown here.
They should sum to one, because a word
can either be present or absent.
In the segment, and similarly for
the second word, we also have two
probabilities representing presence or
absences of this word, and
there is some to y as well.
And finally, we have a lot of
joined probabilities that represent
the scenarios of co-occurrences of
the two words, and they are shown here.
And they sum to one because the two
words can only have these four
possible scenarios.
Either they both occur, so
in that case both variables will have
a value of one, or one of them occurs.
There are two scenarios.
In these two cases one of the random
variables will be equal to one and
the other will be zero and finally we have
the scenario when none of them occurs.
This is when the two variables
taking a value of zero.
So these are the probabilities involved
in the calculation of mutual information,
over here.
Once we know how to calculate
these probabilities,
we can easily calculate
the new gene formation.
It is also interesting to know that
there are actually some relations or
constraint among these probabilities,
and we already saw two of them, right?
So in the previous slide,
that you have seen that
the marginal probabilities of these
words sum to one and
we also have seen this constraint,
that says the two words have these
four scenarios of co-occurrency,
but we also have some additional
constraints listed in the bottom.
For example, this one means if we add up
the probabilities that we observe
the two words occur together and
the probabilities when the first word
occurs and the second word does not occur.
We get exactly the probability
that the first word is observed.
In other words, when the word is observed.
When the first word is observed, and
there are only two scenarios, depending on
whether the second word is also observed.
So, this probability captures the first
scenario when the second word
actually is also observed, and
this captures the second scenario
when the second word is not observed.
So, we only see the first word, and
it is easy to see the other equations
also follow the same reasoning.
Now these equations allow us to
compute some probabilities based on
other probabilities, and
this can simplify the computation.
So more specifically,
if we know the probability that
a word is present, like in this case,
so if we know this, and
if we know the probability of
the presence of the second word,
then we can easily compute
the absence probability, right?
It is very easy to use this
equation to do that, and so
we take care of the computation of
these probabilities of presence and
absence of each word.
Now let's look at
the [INAUDIBLE] distribution.
Let us assume that we also have available
the probability that
they occurred together.
Now it is easy to see that we can
actually compute all the rest of these
probabilities based on these.
Specifically for
example using this equation we can compute
the probability that the first word
occurred and the second word did not,
because we know these probabilities in
the boxes, and similarly using this
equation we can compute the probability
that we observe only the second word.
Word.
And then finally,
this probability can be calculated
by using this equation because
now this is known, and
this is also known, and
this is already known, right.
So this can be easier to calculate.
So now this can be calculated.
So this slide shows that we only
need to know how to compute
these three probabilities
that are shown in the boxes,
naming the presence of each word and the
co-occurence of both words, in a segment.
[MUSIC]

